Geocoding in Python allows you to convert tabular datasets that contain geographic information to be mapped and analyzed within GIS software. For example, you cleaned and analyzed a dataset of stops and frisks by New York city police in Python and would like to share the dataset with a collaborator, but your collaborator is not a Python user. As such, you will have to provide the collaborator with a file that can be opened up in GIS software, such as ArcGIS or QGIS. You can use the GeoDataFrame() constructor in the geopandas package to convert a pandas DataFrame containing columns for latitude and longitude to a GeoDataFrame that has a column with geometry, and then save it as a shapefile by using the to_file() method. Here is the data for this exercise.
If you already have Anaconda downloaded and installed, you can skip Part A and directly start the analysis in Part B. Make sure you also have packages pandas, geopandas, and matplotlib installed in the environment where you would like to conduct this analysis.
1) First, download Anaconda. Anaconda is a free and open-source distribution of Python. You can use Anaconda to install IDEs (integrated development environments where you can write and run code) and packages like Pandas and Geopandas. Go to the link to download Anaconda, https://www.anaconda.com/products/individual, and then open the .exe file that was downloaded and follow the instructions in the installation wizard prompt.
2) Once installation is complete, open Anaconda Navigator and create a new environment for your project. A Conda environment is a directory that contains a specific collection of Conda packages that you have installed. Conda has a default environment called 'base' that includes a Python installation and some core system libraries and dependencies of Conda. It is a “best practice” to avoid installing additional packages into your base environment, and, instead, create an isolated environment to manage packages and dependencies in a new project.
Click on the Environments selection in the left sidebar menu and then click on the 'Create' at the bottom. This will open a dialog box prompting you to create a name for the new environment. You can give any name for your new environment. Here, we use 'GIS_in_Python' as the environment name. Then click the 'Create' button within the dialog box to finish the creation.
3) Once you have your project environment set up, click on the arrow to the right of your new environment, 'GIS_in_Python' in this example, and select Open Terminal. This will give you access to the command line interface on your computer in a window.
4) Install the packages/libraries necessary for the analysis by entering the following commands in the opened terminal, one line at a time:
conda install pandas
conda install geopandas
conda install matplotlib
5) Once you have those libraries all installed, select the new environment, 'GIS_in_Python' in this example, in the 'Applications on' dropdown menu, and then click "install" and "launch" under Jupyter Notebook. Jupyter Notebook will open in your web browser (it does not require the internet to work).
6) In Jupyter Notebook, navigate to the folder where you saved the code file you plan to use and open the .ipynb file (the extension for Jupyter Notebook files written in Python) to run it in the Notebook. If you would like to create a new .ipynb file, browse to the folder in which you would like to save your Notebook, then click the "New" dropdown button on the top-right and select "Python 3". Your new Notebook will open in a new tab in your browser. If you want to create a new directory using the Jupyter Notebook dashboard, click the "New" dropdown button and then select "Folder". To add files from your local machine, click the "Upload" button on the top-right to open a file chooser window and then choose the file you wish to upload.
1) Import necessary packages/libraries.
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
from geopandas import GeoDataFrame
from geopandas import points_from_xy
2) Use the read_csv() function from the pandas package to read the tabular dataset that contains location coordinates (in the form of a latitude field and a longitude field) you would like to share or would like to further analyze before sharing. Optionally, you can use the head() method to returns the first 5 rows of the dataframe.
df = pd.read_csv("sqf-2019.csv")
df.head()
| STOP_ID_ANONY | STOP_FRISK_DATE | STOP_FRISK_TIME | YEAR2 | MONTH2 | DAY2 | STOP_WAS_INITIATED | RECORD_STATUS_CODE | ISSUING_OFFICER_RANK | ISSUING_OFFICER_COMMAND_CODE | ... | STOP_LOCATION_PRECINCT | STOP_LOCATION_SECTOR_CODE | STOP_LOCATION_APARTMENT | STOP_LOCATION_FULL_ADDRESS | STOP_LOCATION_STREET_NAME | STOP_LOCATION_X | STOP_LOCATION_Y | STOP_LOCATION_ZIP_CODE | STOP_LOCATION_PATROL_BORO_NAME | STOP_LOCATION_BORO_NAME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1/2/19 | 14:30:00 | 2019 | January | Wednesday | Based on C/W on Scene | APP | POM | 1 | ... | 1 | C | (null) | 230 VESEY STREET | VESEY STREET | 979667 | 199737 | (null) | PBMS | MANHATTAN |
| 1 | 2 | 1/8/19 | 2:30:00 | 2019 | January | Tuesday | Based on Self Initiated | APP | POM | 1 | ... | 1 | C | (null) | 9 WHITE STREET | WHITE STREET | 982650 | 201326 | (null) | PBMS | MANHATTAN |
| 2 | 3 | 1/12/19 | 16:54:00 | 2019 | January | Saturday | Based on Radio Run | APP | POM | 1 | ... | 1 | D | (null) | 131 SPRING STREET | SPRING STREET | 984063 | 203033 | (null) | PBMS | MANHATTAN |
| 3 | 4 | 1/14/19 | 21:21:00 | 2019 | January | Monday | Based on Radio Run | APP | POM | 1 | ... | 1 | ( | (null) | GRAND STREET && 6TH AVE | GRAND STREET | 982848 | 202677 | (null) | PBMS | MANHATTAN |
| 4 | 5 | 1/15/19 | 18:50:00 | 2019 | January | Tuesday | Based on Radio Run | APP | POM | 1 | ... | 1 | D | (null) | 32 THOMPSON STREET | THOMPSON STREET | 983100 | 202705 | (null) | PBMS | MANHATTAN |
5 rows × 83 columns
3) Create a GeoDataFrame by using geopandas points_from_xy() to transform longitude and latitude into a list of shapely.Point objects and set it as a geometry:
GeoDataFrame() constructor is the name of the standard dataframe. points_from_xy(“Longitude”,“Latitude”) indicate the pandas Series that contain longitude and latitude information. gdf = GeoDataFrame(df, geometry=points_from_xy(df['STOP_LOCATION_X'], df['STOP_LOCATION_Y']), crs = "EPSG:2263")
gdf.head()
| STOP_ID_ANONY | STOP_FRISK_DATE | STOP_FRISK_TIME | YEAR2 | MONTH2 | DAY2 | STOP_WAS_INITIATED | RECORD_STATUS_CODE | ISSUING_OFFICER_RANK | ISSUING_OFFICER_COMMAND_CODE | ... | STOP_LOCATION_SECTOR_CODE | STOP_LOCATION_APARTMENT | STOP_LOCATION_FULL_ADDRESS | STOP_LOCATION_STREET_NAME | STOP_LOCATION_X | STOP_LOCATION_Y | STOP_LOCATION_ZIP_CODE | STOP_LOCATION_PATROL_BORO_NAME | STOP_LOCATION_BORO_NAME | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1/2/19 | 14:30:00 | 2019 | January | Wednesday | Based on C/W on Scene | APP | POM | 1 | ... | C | (null) | 230 VESEY STREET | VESEY STREET | 979667 | 199737 | (null) | PBMS | MANHATTAN | POINT (979667.000 199737.000) |
| 1 | 2 | 1/8/19 | 2:30:00 | 2019 | January | Tuesday | Based on Self Initiated | APP | POM | 1 | ... | C | (null) | 9 WHITE STREET | WHITE STREET | 982650 | 201326 | (null) | PBMS | MANHATTAN | POINT (982650.000 201326.000) |
| 2 | 3 | 1/12/19 | 16:54:00 | 2019 | January | Saturday | Based on Radio Run | APP | POM | 1 | ... | D | (null) | 131 SPRING STREET | SPRING STREET | 984063 | 203033 | (null) | PBMS | MANHATTAN | POINT (984063.000 203033.000) |
| 3 | 4 | 1/14/19 | 21:21:00 | 2019 | January | Monday | Based on Radio Run | APP | POM | 1 | ... | ( | (null) | GRAND STREET && 6TH AVE | GRAND STREET | 982848 | 202677 | (null) | PBMS | MANHATTAN | POINT (982848.000 202677.000) |
| 4 | 5 | 1/15/19 | 18:50:00 | 2019 | January | Tuesday | Based on Radio Run | APP | POM | 1 | ... | D | (null) | 32 THOMPSON STREET | THOMPSON STREET | 983100 | 202705 | (null) | PBMS | MANHATTAN | POINT (983100.000 202705.000) |
5 rows × 84 columns
4) Optionally, you can use matplotlib for plotting to generate an overview of your GeoDataFrame:
figsize = (12,12), the first number corresponding to width, the X axis, and the second corresponding to height, the Y axis. ax = ax sets axes on which to draw the plot.
fig, ax = plt.subplots(figsize=(12, 12))
gdf.plot(ax=ax)
<AxesSubplot:>
5) Use the to_file() method to write the GeoDataFrame as a shapefile. The argument passed to the method is the name we want to give to the file that will be written to our disk along with its file extension (since we want the file to be written as a shapefile, we use the .shp extension).
gdf.to_file("sqf_2019.shp")
At this point, the “sqf_2019.shp” file is written to the directory where your .ipynb file saves; this file can be shared with your collaborator, who can open up the shapefile in ArcGIS or QGIS. If you are a GIS user, you may wish to open up ArcGIS or QGIS and import this newly written shapefile to verify that everything looks as expected.